A proposal for shared memory based backup infrastructure

  • Jump to comment-1
    bharath.rupireddyforpostgres@gmail.com2022-07-23T09:28:53+00:00
    Hi, Right now, the session that starts the backup with pg_backup_start() has to end it with pg_backup_stop() which returns the backup_label and tablespace_map contents (commit 39969e2a1). If the backups were to be taken using custom disk snapshot tools on production servers, following are the high-level steps involved: 1) open a session 2) run pg_backup_start() using the same session opened in (1) 3) run custom disk snapshot tools which may, many-a-times, will copy the entire data directory over the network 4) run pg_backup_stop() using the same session opened in (1) Typically, step (3) takes a good amount of time in production environments with terabytes or petabytes scale of data and keeping the session alive from step (1) to (4) has overhead and it wastes the resources. And the session can get closed for various reasons - idle in session timeout, tcp/ip keepalive timeout, network problems etc. All of these can render the backup useless. What if the backup started by a session can also be closed by another session? This seems to be achievable, if we can place the backup_label, tablespace_map and other required session/backend level contents in shared memory with the key as backup_label name. It's a long way to go. The idea may be naive at this stage and there might be something important that doesn't let us do the proposed solution. I would like to hear more thoughts from the hackers. Thanks to Sameer, Satya (cc-ed) for the offlist discussion. Regards, Bharath Rupireddy.
    • Jump to comment-1
      mahendrakarforpg@gmail.com2022-07-25T04:33:03+00:00
      Hi Bharath, *"Typically, step (3) takes a good amount of time in productionenvironments with terabytes or petabytes scale of data and keeping thesession alive from step (1) to (4) has overhead and it wastes theresources. And the session can get closed for various reasons - idlein session timeout, tcp/ip keepalive timeout, network problems etc.All of these can render the backup useless."* >> this could be a common scenario and needs to be addressed. *"What if the backup started by a session can also be closed by anothersession? This seems to be achievable, if we can place thebackup_label, tablespace_map and other required session/backend levelcontents in shared memory with the key as backup_label name. It's along way to go."* *>> * I think storing metadata about backup of a session in shared memory may not work as it gets purged when the database goes for restart. We might require a separate catalogue table to handle the backup session. Thanks, Mahendrakar. On Sat, 23 Jul 2022 at 14:59, Bharath Rupireddy < bharath.rupireddyforpostgres@gmail.com> wrote: > Hi, > > Right now, the session that starts the backup with pg_backup_start() > has to end it with pg_backup_stop() which returns the backup_label and > tablespace_map contents (commit 39969e2a1). If the backups were to be > taken using custom disk snapshot tools on production servers, > following are the high-level steps involved: > 1) open a session > 2) run pg_backup_start() using the same session opened in (1) > 3) run custom disk snapshot tools which may, many-a-times, will copy > the entire data directory over the network > 4) run pg_backup_stop() using the same session opened in (1) > > Typically, step (3) takes a good amount of time in production > environments with terabytes or petabytes scale of data and keeping the > session alive from step (1) to (4) has overhead and it wastes the > resources. And the session can get closed for various reasons - idle > in session timeout, tcp/ip keepalive timeout, network problems etc. > All of these can render the backup useless. > > What if the backup started by a session can also be closed by another > session? This seems to be achievable, if we can place the > backup_label, tablespace_map and other required session/backend level > contents in shared memory with the key as backup_label name. It's a > long way to go. The idea may be naive at this stage and there might be > something important that doesn't let us do the proposed solution. I > would like to hear more thoughts from the hackers. > > Thanks to Sameer, Satya (cc-ed) for the offlist discussion. > > Regards, > Bharath Rupireddy. > > >
      • Jump to comment-1
        bharath.rupireddyforpostgres@gmail.com2022-07-25T06:29:50+00:00
        On Mon, Jul 25, 2022 at 10:03 AM mahendrakar s <mahendrakarforpg@gmail.com> wrote: > > Hi Bharath, Thanks Mahendrakar for taking a look at the design. > "Typically, step (3) takes a good amount of time in production > environments with terabytes or petabytes scale of data and keeping the > session alive from step (1) to (4) has overhead and it wastes the > resources. And the session can get closed for various reasons - idle > in session timeout, tcp/ip keepalive timeout, network problems etc. > All of these can render the backup useless." > > >> this could be a common scenario and needs to be addressed. Hm. Additionally, the problem of keeping the session that starts the backup open until the entire data directory is backed-up becomes more worrisome if we were to run backups for a huge number of servers at scale - the entity (control plane or whatever), that is responsible for taking backups across huge fleet of postgres production servers, will have tremendous amount of resources wasted and it's a problem for that entity to keep the backup sessions active until the actual backup is finished. > "What if the backup started by a session can also be closed by another > session? This seems to be achievable, if we can place the > backup_label, tablespace_map and other required session/backend level > contents in shared memory with the key as backup_label name. It's a > long way to go." > > >> I think storing metadata about backup of a session in shared memory may not work as it gets purged when the database goes for restart. We might require a separate catalogue table to handle the backup session. Right now, the non-exclusive (and we don't have exclusive backups now from postgres 15) backup will anyway become useless if the postgres restarts, because there's no running backup state (backup_label, tablespace_map contents) that's persisted. Following are few more thoughts with the shared memory based backups as proposed in this thread: 1) How many max backups do we want to allow? Right now, there's no limit, I believe, max_connections number of concurrent backups can be taken - we have XLogCtlInsert->runningBackups but no limit. If we were to use shared memory to track the backup state, we might or might not have to decide on max backup limit to not preallocate and consume shared memory unnecessarily, otherwise, we could use something like dynamic shared memory hash table for storing backup state. 2) How to deal with the backups that are started but no one is coming to stop them? Basically, when to declare that the backup is dead or expired? Perhaps, we can have a max time limit after which if no stop backup is issued for a backup, which is then marked as dead or expired. We may or may not want to think on the above points for now until the idea in general has some benefits over the current backup infrastructure. Regards, Bharath Rupireddy.
        • Jump to comment-1
          mahendrakarforpg@gmail.com2022-07-30T06:53:47+00:00
          Hi Bharath, There might be security concerns if the backup started by one user can be stopped by another user. This is because the user who stops the backup will get the backup_label or table space map file contents of other user. Isn't this a concern for non-exclusive backup? I think there should be role based control for backup related activity which can prevent other unprivileged users from stopping the backup. Thoughts? Thanks, Mahendrakar. On Mon, 25 Jul 2022 at 12:00, Bharath Rupireddy < bharath.rupireddyforpostgres@gmail.com> wrote: > On Mon, Jul 25, 2022 at 10:03 AM mahendrakar s > <mahendrakarforpg@gmail.com> wrote: > > > > Hi Bharath, > > Thanks Mahendrakar for taking a look at the design. > > > "Typically, step (3) takes a good amount of time in production > > environments with terabytes or petabytes scale of data and keeping the > > session alive from step (1) to (4) has overhead and it wastes the > > resources. And the session can get closed for various reasons - idle > > in session timeout, tcp/ip keepalive timeout, network problems etc. > > All of these can render the backup useless." > > > > >> this could be a common scenario and needs to be addressed. > > Hm. Additionally, the problem of keeping the session that starts the > backup open until the entire data directory is backed-up becomes more > worrisome if we were to run backups for a huge number of servers at > scale - the entity (control plane or whatever), that is responsible > for taking backups across huge fleet of postgres production servers, > will have tremendous amount of resources wasted and it's a problem for > that entity to keep the backup sessions active until the actual backup > is finished. > > > "What if the backup started by a session can also be closed by another > > session? This seems to be achievable, if we can place the > > backup_label, tablespace_map and other required session/backend level > > contents in shared memory with the key as backup_label name. It's a > > long way to go." > > > > >> I think storing metadata about backup of a session in shared memory > may not work as it gets purged when the database goes for restart. We might > require a separate catalogue table to handle the backup session. > > Right now, the non-exclusive (and we don't have exclusive backups now > from postgres 15) backup will anyway become useless if the postgres > restarts, because there's no running backup state (backup_label, > tablespace_map contents) that's persisted. > > Following are few more thoughts with the shared memory based backups > as proposed in this thread: > > 1) How many max backups do we want to allow? Right now, there's no > limit, I believe, max_connections number of concurrent backups can be > taken - we have XLogCtlInsert->runningBackups but no limit. If we were > to use shared memory to track the backup state, we might or might not > have to decide on max backup limit to not preallocate and consume > shared memory unnecessarily, otherwise, we could use something like > dynamic shared memory hash table for storing backup state. > > 2) How to deal with the backups that are started but no one is coming > to stop them? Basically, when to declare that the backup is dead or > expired? Perhaps, we can have a max time limit after which if no stop > backup is issued for a backup, which is then marked as dead or > expired. > > We may or may not want to think on the above points for now until the > idea in general has some benefits over the current backup > infrastructure. > > Regards, > Bharath Rupireddy. >
          • Jump to comment-1
            robertmhaas@gmail.com2022-08-05T15:55:33+00:00
            On Sat, Jul 30, 2022 at 2:54 AM mahendrakar s <mahendrakarforpg@gmail.com> wrote: > There might be security concerns if the backup started by one user can be stopped by another user. > This is because the user who stops the backup will get the backup_label or table space map file contents of other user. > Isn't this a concern for non-exclusive backup? This doesn't seem like a real problem. If you can take a backup, you're already a highly-privileged user. -- Robert Haas EDB: http://www.enterprisedb.com
          • Jump to comment-1
            bharath.rupireddyforpostgres@gmail.com2022-08-04T07:18:59+00:00
            On Sat, Jul 30, 2022 at 12:23 PM mahendrakar s <mahendrakarforpg@gmail.com> wrote: > > On Mon, 25 Jul 2022 at 12:00, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: >> >> On Mon, Jul 25, 2022 at 10:03 AM mahendrakar s >> <mahendrakarforpg@gmail.com> wrote: >> > >> > Hi Bharath, >> >> Thanks Mahendrakar for taking a look at the design. >> >> > "Typically, step (3) takes a good amount of time in production >> > environments with terabytes or petabytes scale of data and keeping the >> > session alive from step (1) to (4) has overhead and it wastes the >> > resources. And the session can get closed for various reasons - idle >> > in session timeout, tcp/ip keepalive timeout, network problems etc. >> > All of these can render the backup useless." >> > >> > >> this could be a common scenario and needs to be addressed. >> >> Hm. Additionally, the problem of keeping the session that starts the >> backup open until the entire data directory is backed-up becomes more >> worrisome if we were to run backups for a huge number of servers at >> scale - the entity (control plane or whatever), that is responsible >> for taking backups across huge fleet of postgres production servers, >> will have tremendous amount of resources wasted and it's a problem for >> that entity to keep the backup sessions active until the actual backup >> is finished. >> >> > "What if the backup started by a session can also be closed by another >> > session? This seems to be achievable, if we can place the >> > backup_label, tablespace_map and other required session/backend level >> > contents in shared memory with the key as backup_label name. It's a >> > long way to go." >> > >> > >> I think storing metadata about backup of a session in shared memory may not work as it gets purged when the database goes for restart. We might require a separate catalogue table to handle the backup session. >> >> Right now, the non-exclusive (and we don't have exclusive backups now >> from postgres 15) backup will anyway become useless if the postgres >> restarts, because there's no running backup state (backup_label, >> tablespace_map contents) that's persisted. >> >> Following are few more thoughts with the shared memory based backups >> as proposed in this thread: >> >> 1) How many max backups do we want to allow? Right now, there's no >> limit, I believe, max_connections number of concurrent backups can be >> taken - we have XLogCtlInsert->runningBackups but no limit. If we were >> to use shared memory to track the backup state, we might or might not >> have to decide on max backup limit to not preallocate and consume >> shared memory unnecessarily, otherwise, we could use something like >> dynamic shared memory hash table for storing backup state. >> >> 2) How to deal with the backups that are started but no one is coming >> to stop them? Basically, when to declare that the backup is dead or >> expired? Perhaps, we can have a max time limit after which if no stop >> backup is issued for a backup, which is then marked as dead or >> expired. >> >> We may or may not want to think on the above points for now until the >> idea in general has some benefits over the current backup >> infrastructure. > > Hi Bharath, > > There might be security concerns if the backup started by one user can be stopped by another user. > This is because the user who stops the backup will get the backup_label or table space map file contents of other user. > Isn't this a concern for non-exclusive backup? > > I think there should be role based control for backup related activity which can prevent other unprivileged users from stopping the backup. > > Thoughts? The pg_backup_start() and pg_backup_stop() functions are role based - restricted to superusers by default, but other users can be granted EXECUTE to run the functions - I think the existing behaviour would suffice. However, the responsibility of not letting the users stop backups started by other users (yes, just with the label name) can lie with those who use these functions with the new shared memory based backups, they have to ensure that whoever starts the backup, they only should stop it. Perhaps, we can call that out in the documentations explicitly. -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/